A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

Abstract: Understanding whether and how language model agents pursue goals is essential for ensuring the safety of AI systems deployed to act autonomously in the world. In this work, we study goal-directedness in a language model agent, GPT-OSS-20B, as it navigates procedurally generated 2D grid environments. We operationalize goal-directedness behaviourally--through the optimality of an agent's actions and through its robustness to environment perturbations--and representationally--by probing the agent's internal activations for evidence of structured spatial knowledge. Our behavioural evaluation reveals that GPT-OSS-20B generally acts as a goal-directed agent, navigating towards the goal across a range of grid sizes with above-chance optimality. Representationally, linear and MLP probes trained on the agent's residual stream activations at intermediate layers uncover internal representations that partially encode the spatial layout of the environment, including the positions of walls, the goal, and the agent itself. Taken together, our results indicate that GPT-OSS-20B can act as a goal-directed agent through reliance on internal representations that partially but non-trivially encode the spatial features of its environment.

Paper: arxiv.org/abs/2602.08964

Data and trained probes: huggingface.co/project-telos

Citation

@article{arghal-etal-2026-behavioural,
    title={A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents},
    author={Raghu Arghal and Fade Chen and Niall Dalton and Evgenii Kortukov and Calum McNamara and Angelos Nalmpantis and Moksh Nirvaan and Gabriele Sarti and Mario Giulianelli},
    year={2026},
    journal={arXiv preprint arXiv:2602.08964},
    url={https://arxiv.org/abs/2602.08964}
}

Installation

# Clone the repository
git clone https://github.com/SPAR-Telos/interp
cd interp

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e .

# For vLLM-based activation extraction (requires GPU)
pip install -e ".[vllm]"

Data and Trained Probes

Pre-computed trajectories, activations, trained probes, and evaluation results are available on the project-telos HuggingFace organization.

Trained probes:

Repository	Description
cognitive_map_probes	Trained cell identity classification probes
distance_probes	Trained A* distance regression probes
decoder_probes	Trained decoder probes

Datasets:

Repository	Description
trajectories_train_single_step	Training trajectories (single step)
trajectories_test_full	Test trajectories (full episodes)
trajectories_test_full_with_cognitive_map_probes	Test trajectories with probe predictions
trajectories_key_door_100	Key-door environment trajectories
trajectories_key_no_door_100	Key-no-door environment trajectories
probes_train_single_step	Prepared probe training data
activations_test_full	Extracted test activations
activations_key_door_env_100	Key-door environment activations
cognitive_map_probes_results	Cell identity probe evaluation results
distance_probes_results	Distance probe evaluation results

Reproduction

The analysis pipeline has four stages. Each stage uses a CLI command provided by the interp-cli tool. See telos_interp/commands/README.md for full documentation of all commands and options.

To use pre-computed data, download the relevant datasets from the HuggingFace organization above and point the CLI commands to the downloaded directories.

1. Gather activations

Extract model activations from trajectory JSON files:

interp-cli gather_activations \
    --trajectory-paths "data/trajectories/size5/*.json" \
    --output-dir data/activations/size5 \
    --layers all \
    --steps 0 \
    --output-indices -1

2. Prepare activations for probing

Format extracted activations into datasets suitable for probe training:

interp-cli prepare_activations_for_probing \
    --activations-dir data/activations/size5 \
    --trajectories-dir data/trajectories/size5 \
    --probe-type grid_tile \
    --output-indices -1 \
    --balance-classes-per-trajectory

3. Train probes

Train cell identity classifiers or distance regression probes:

# Cell identity probe
interp-cli train_cognitive_map_probe \
    --train-data-path data/activations/size5/cognitive_map_activations_*.pt \
    --model-type mlp \
    --hidden-dims "512,256" \
    --num-epochs 100

# Distance regression probe
interp-cli train_distance_probe \
    --train-data-path data/activations/size7/distance_activations_*.pt \
    --model-type mlp \
    --hidden-dims "512,256" \
    --num-epochs 100

Example configuration files are provided in configs/.

4. Evaluate and apply probes

Evaluate probes on held-out data and apply them to generate trajectory-level predictions:

# Evaluate cell identity probe
interp-cli eval_cognitive_map_probe \
    --trajectories-dir data/trajectories/size5_test \
    --activations-dir data/activations/size5_test \
    --probe-path path/to/cognitive_map_probe.pt \
    --output-indices -1

# Apply probe to trajectories
interp-cli apply_cognitive_map_probe \
    --activations-dir data/activations/size5 \
    --trajectories-dir data/trajectories/size5 \
    --probe-path path/to/cognitive_map_probe.pt \
    --output-dir data/trajectories_with_probes/size5 \
    --layers 20 \
    --steps all \
    --output-indices -1

Development

See CONTRIBUTING.md for development setup, code style, and testing instructions.

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.vscode		.vscode
configs		configs
evaluation_scripts		evaluation_scripts
plotting_scripts		plotting_scripts
spaces		spaces
telos_interp		telos_interp
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Citation

Installation

Data and Trained Probes

Reproduction

1. Gather activations

2. Prepare activations for probing

3. Train probes

4. Evaluate and apply probes

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

SPAR-Telos/interp

Folders and files

Latest commit

History

Repository files navigation

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Citation

Installation

Data and Trained Probes

Reproduction

1. Gather activations

2. Prepare activations for probing

3. Train probes

4. Evaluate and apply probes

Development

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages